Overview:
- Apache Avro is a serialization framework provided by Apache.
- In Apache Avro, Messages or data structures or simply data can be defined using JSON format.
- During serialization data is written along with the schema of the data, using the APIs alone without using any generated code. This is in contrast with how similar serialization frameworks like Protocol Buffers work.
- As the schema evolves over time, old schema and the new schema are stored along with the data to take care of version resolution.
- The performance gains offered by Apache Avro is significant since data and schema coexist.
- The schemas in Apache Avro are written using JSON-JavaScript Object Notation.
How to define data types using Apache Avro:
- Apache Avro schemas consist of two kinds of types:
- Primitive Type
- Complex Type
- The following complex types can be defined in an Apache Avro schema
- record
- enum
- array
- map
- union
- fixed
using the following primitive types:
- int
- long
- float
- double
- string
- bytes
- boolean
- null
- Almost all the complex types have a name. Depending on the kind of complex type those complex types have their data fields and type information.
- The complex type fixed shows the power of Apache Avro when it comes to the compactness of the data being serialized. The complex type fixed allows a developer to specify the data in number of bytes.
- The Python example in this article defines a complex type Conference and serializes data using Apache Avro.
Serialization process in Apache Avro:
- Apache Avro offers two types of serialization formats:
- Binary format - For production use
- JSON format - For debugging purposes
and this article will focus on the binary format.
- Some of the significant observations on the binary encoding scheme followed by Apache are:
- Avro uses variable length zig-zag encoding for integers and long values.
- Encoding and decoding happens through a depth-first, left-to-right traversal of the schema.
Example:
An Apache Avro Schema:
{"namespace": "demo.avro","type": "record", "name": "Conference", "fields": [ {"name": "name", "type": "string"}, {"name": "date", "type": "long"}, {"name": "location", "type": "string"}, {"name": "speakers", "type": {"type":"array","items":"string"}}, {"name": "participants", "type": {"type": "array", "items": "string"}}, {"name": "seatingArrangement", "type": {"type": "map", "values": "int"}} ] } |
A Python Program serializing data using Apache Avro:
import avro.schema from avro.datafile import DataFileReader, DataFileWriter from avro.io import DatumReader, DatumWriter
# Parse the schema file schema = avro.schema.Parse(open("demo.avsc", "rb").read())
# Create a data file using DataFileWriter dataFile = open("participants.avro", "wb") writer = DataFileWriter(dataFile, DatumWriter(), schema)
# Write data using DatumWriter writer.append({"name": "Virutal conference", "date": 25612345, "location":"New York", "speakers":["Speaker1","Speaker2"], "participants":["Participant1","Participant2","Participant3","Participant4","Participant5"], "seatingArrangement":{"Participant1":1, "Participant2":2, "Participant3":3, "Participant4":4, "Participant5":5} }) writer.close() |
Output:
Objavro.codecnullavro.schema�{"type": "record", "name": "Conference", "namespace": "demo.avro", "fields": [{"type": "string", "name": "name"}, {"type": "long", "name": "date"}, {"type": "string", "name": "location"}, {"type": {"type": "array", "items": "string"}, "name": "speakers"}, {"type": {"type": "array", "items": "string"}, "name": "participants"}, {"type": {"type": "map", "values": "int"}, "name": "seatingArrangement"}]}3��Zdy |